ifood 期中報告¶

公司簡介¶

iFood 成立於 2011 年,是巴西領先的線上食品外送公司,服務範圍涵蓋巴西、哥倫比亞和墨西哥等地。透過 iFood 的行動應用程式,使用者可以選擇餐廳、查看菜單與價格,並輕鬆下單購買,類似於中國的美团或饿了么等外送平台。iFood 擁有超過 1,200 萬用戶,整合了近 7 萬家餐廳與 12 萬名送餐員,外送服務遍及巴西 500 多個城市。

No description has been provided for this image

iFood 致力於構建以「吃」為核心的生態系統,不僅提供食品外送服務,還涉足其他領域,如外送雜貨和藥局服務,甚至是寵物用品。其應用程式簡單易用,成功吸引多元受眾,並在巴西外送市場中占據主導地位。

ifood phone app

公司銷售欄位簡介¶

人口統計資訊¶

  • 年齡: Age
  • 收入: Income
  • 家中兒童數量: Kidhome
  • 家中青少年數量: Teenhome
  • 婚姻狀態:
    • 婚姻狀態 - 離婚: marital_Divorced
    • 婚姻狀態 - 已婚: marital_Married
    • 婚姻狀態 - 單身: marital_Single
    • 婚姻狀態 - 同居: marital_Together
    • 婚姻狀態 - 喪偶: marital_Widow
  • 教育程度:
    • 教育程度 - 二級循環: education_2n Cycle
    • 教育程度 - 基本: education_Basic
    • 教育程度 - 畢業: education_Graduation
    • 教育程度 - 碩士: education_Master
    • 教育程度 - 博士: education_PhD
  • 成為顧客的天數: Customer Days

購買行為¶

  • 最近一次購買的天數: Recency
  • 葡萄酒花費: MntWines
  • 水果花費: MntFruits
  • 肉類產品花費: MntMeatProducts
  • 魚類產品花費: MntFishProducts
  • 甜品花費: MntSweetProducts
  • 黃金產品花費: MntGoldProds
  • 常規產品花費: MntRegularProds
  • 總花費: MntTotal

行銷活動與參與度¶

  • 第 1 次行銷活動的接受情況: AcceptedCmp1
  • 第 2 次行銷活動的接受情況: AcceptedCmp2
  • 第 3 次行銷活動的接受情況: AcceptedCmp3
  • 第 4 次行銷活動的接受情況: AcceptedCmp4
  • 第 5 次行銷活動的接受情況: AcceptedCmp5
  • 接受的行銷活動總數: AcceptedCmpOverall
  • 對最後一次行銷活動的回應: Response

購買與聯繫行為¶

  • 促銷購買次數: NumDealsPurchases
  • 網上購買次數: NumWebPurchases
  • 型錄購買次數: NumCatalogPurchases
  • 商店購買次數: NumStorePurchases
  • 每月網頁訪問次數: NumWebVisitsMonth
  • 投訴記錄: Complain

業務成本與收入¶

  • 聯繫成本(固定值): Z_CostContact
  • 收入(固定值): Z_Revenue

讀取資料¶

In [117]:
import polars as pl
In [118]:
file_path = 'ml_project1_data.csv'
df = pl.read_csv(file_path)
In [119]:
df
Out[119]:
shape: (2_240, 29)
IDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWinesMntFruitsMntMeatProductsMntFishProductsMntSweetProductsMntGoldProdsNumDealsPurchasesNumWebPurchasesNumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponse
i64i64strstri64i64i64stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
55241957"Graduation""Single"5813800"2012-09-04"586358854617288883810470000003111
21741954"Graduation""Single"4634411"2014-03-08"381116216211250000003110
41411965"Graduation""Together"7161300"2013-08-21"264264912711121421821040000003110
61821984"Graduation""Together"2664610"2014-02-10"26114201035220460000003110
53241981"PhD""Married"5829310"2014-01-19"9417343118462715553650000003110
……………………………………………………………………………
108701967"Graduation""Married"6122301"2013-06-13"467094318242118247293450000003110
40011946"PhD""Together"6401421"2014-06-10"56406030008782570001003110
72701981"Graduation""Divorced"5698100"2014-01-25"91908482173212241231360100003110
82351956"Master""Together"6924501"2014-01-24"8428302148030612651030000003110
94051954"PhD""Married"5286911"2012-10-15"40843612121331470000003111

資料整理¶

  • 檢查缺失 (null) 值
  • 補充匯總性欄位(總銷售額、總購買次數)

檢查是否有缺失值。

In [120]:
df.describe()
Out[120]:
shape: (9, 30)
statisticIDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWinesMntFruitsMntMeatProductsMntFishProductsMntSweetProductsMntGoldProdsNumDealsPurchasesNumWebPurchasesNumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponse
strf64f64strstrf64f64f64strf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
"count"2240.02240.0"2240""2240"2216.02240.02240.0"2240"2240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.02240.0
"null_count"0.00.0"0""0"24.00.00.0"0"0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
"mean"5592.1598211968.805804nullnull52247.2513540.4441960.50625null49.109375303.93571426.302232166.9537.52544627.06294644.0218752.3254.0848212.6620545.7901795.3165180.0727680.0745540.0727680.0642860.0133930.0093753.011.00.149107
"std"3246.66219811.984069nullnull25173.0766610.5383980.544538null28.962453336.59739339.773434225.71537354.62897941.28049852.1674391.9322382.7787142.9231013.2509582.4266450.2598130.2627280.2598130.2453160.1149760.0963910.00.00.356274
"min"0.01893.0"2n Cycle""Absurd"1730.00.00.0"2012-07-30"0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.03.011.00.0
"25%"2829.01959.0nullnull35322.00.00.0null24.024.01.016.03.01.09.01.02.00.03.03.00.00.00.00.00.00.03.011.00.0
"50%"5462.01970.0nullnull51390.00.00.0null49.0174.08.067.012.08.024.02.04.02.05.06.00.00.00.00.00.00.03.011.00.0
"75%"8427.01977.0nullnull68487.01.01.0null74.0504.033.0232.050.033.056.03.06.04.08.07.00.00.00.00.00.00.03.011.00.0
"max"11191.01996.0"PhD""YOLO"666666.02.02.0"2014-06-29"99.01493.0199.01725.0259.0263.0362.015.027.028.013.020.01.01.01.01.01.01.03.011.01.0

發現 Income 有缺失值,佔據整個 dataset 的 1%。直接刪除即可。

In [121]:
df = df.drop_nulls(subset=['Income'])

新增 TotalSales(總銷售額)

In [122]:
df = df.with_columns(
    TotalSales=pl.sum_horizontal(
        'MntWines',
        'MntFruits',
        'MntMeatProducts',
        'MntFishProducts',
        'MntSweetProducts',
        'MntGoldProds',
))
In [123]:
df
Out[123]:
shape: (2_216, 30)
IDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWinesMntFruitsMntMeatProductsMntFishProductsMntSweetProductsMntGoldProdsNumDealsPurchasesNumWebPurchasesNumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponseTotalSales
i64i64strstri64i64i64stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
55241957"Graduation""Single"5813800"2012-09-04"5863588546172888838104700000031111617
21741954"Graduation""Single"4634411"2014-03-08"38111621621125000000311027
41411965"Graduation""Together"7161300"2013-08-21"264264912711121421821040000003110776
61821984"Graduation""Together"2664610"2014-02-10"2611420103522046000000311053
53241981"PhD""Married"5829310"2014-01-19"9417343118462715553650000003110422
………………………………………………………………………………
108701967"Graduation""Married"6122301"2013-06-13"4670943182421182472934500000031101341
40011946"PhD""Together"6401421"2014-06-10"56406030008782570001003110444
72701981"Graduation""Divorced"5698100"2014-01-25"919084821732122412313601000031101241
82351956"Master""Together"6924501"2014-01-24"8428302148030612651030000003110843
94051954"PhD""Married"5286911"2012-10-15"40843612121331470000003111172

新增 TotalPurchases(總購買次數)

In [124]:
df = df.with_columns(
    TotalPurchases=pl.sum_horizontal(
        'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
))
In [125]:
df
Out[125]:
shape: (2_216, 31)
IDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWinesMntFruitsMntMeatProductsMntFishProductsMntSweetProductsMntGoldProdsNumDealsPurchasesNumWebPurchasesNumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponseTotalSalesTotalPurchases
i64i64strstri64i64i64stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
55241957"Graduation""Single"5813800"2012-09-04"586358854617288883810470000003111161725
21741954"Graduation""Single"4634411"2014-03-08"381116216211250000003110276
41411965"Graduation""Together"7161300"2013-08-21"26426491271112142182104000000311077621
61821984"Graduation""Together"2664610"2014-02-10"26114201035220460000003110538
53241981"PhD""Married"5829310"2014-01-19"941734311846271555365000000311042219
…………………………………………………………………………………
108701967"Graduation""Married"6122301"2013-06-13"467094318242118247293450000003110134118
40011946"PhD""Together"6401421"2014-06-10"5640603000878257000100311044422
72701981"Graduation""Divorced"5698100"2014-01-25"91908482173212241231360100003110124119
82351956"Master""Together"6924501"2014-01-24"842830214803061265103000000311084323
94051954"PhD""Married"5286911"2012-10-15"4084361212133147000000311117211

探索性分析¶

婚姻情況與教育情況。

In [126]:
import altair as alt

(
    df.plot.bar(alt.X("Marital_Status", title="婚姻狀態"), y=alt.Y("count()", title="人數"))
        | df.plot.bar(alt.X("Education", title="教育程度"), y=alt.Y("count()", title="人數"))
        | df.plot.bar(alt.X("Education", title="教育程度"), alt.Y("count()", title="人數", stack="normalize"), alt.Color("Marital_Status", title="婚姻狀態"))
)
Out[126]:

有小孩或青少年的比例。

In [127]:
(
    df["Kidhome"]
    .value_counts()
    .plot.arc(color="Kidhome:N", theta="count")
    .properties(title="孩子在家的數量")
    | df["Teenhome"]
    .value_counts()
    .plot.arc(color="Teenhome:N", theta="count")
    .properties(title="青少年在家的數量")
)
Out[127]:

參與行銷活動的人數。

In [128]:
df.select(
    Cmp1="AcceptedCmp1",
    Cmp2="AcceptedCmp2",
    Cmp3="AcceptedCmp3",
    Cmp4="AcceptedCmp4",
    Cmp5="AcceptedCmp5",
).sum().transpose(include_header=True, header_name="Campaigns").rename(
    {"column_0": "count"}
).plot.bar(
    alt.X("count", title="參加次數"),
    alt.Y("Campaigns", title="活動"),
    alt.Color("Campaigns", legend=None),
).properties(
    title="各活動參加人數"
)
Out[128]:

顧客年齡分佈。

In [129]:
df = df.with_columns(age=2020 - pl.col("Year_Birth"))

df.plot.bar(alt.X("age", bin=True, title="年齡"), y=alt.Y("count()", title="人數"))
Out[129]:

顧客特性說明¶

  • 年齡:主要集中在 40 歲以上,40-60 區間尤為最多。
  • 孩子:大多沒有孩子,其次是一胎家庭。
  • 婚姻狀態:大多已婚,其次是單身。
  • 教育程度:大學以上之高學歷居多

初次客與回流客分析¶

假設購買次數大於 1 的為回流客,否則為初次客。

In [130]:
CustomerType = pl.Enum(["First-Time", "Returning"])

df = df.with_columns(
    pl.when(pl.col("TotalPurchases") > 1)
    .then(pl.lit("Returning"))
    .otherwise(pl.lit("First-Time"))
    .cast(CustomerType)
    .alias("CustomerType")
)
In [131]:
df
Out[131]:
shape: (2_216, 33)
IDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWinesMntFruitsMntMeatProductsMntFishProductsMntSweetProductsMntGoldProdsNumDealsPurchasesNumWebPurchasesNumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponseTotalSalesTotalPurchasesageCustomerType
i64i64strstri64i64i64stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64enum
55241957"Graduation""Single"5813800"2012-09-04"58635885461728888381047000000311116172563"Returning"
21741954"Graduation""Single"4634411"2014-03-08"38111621621125000000311027666"Returning"
41411965"Graduation""Together"7161300"2013-08-21"2642649127111214218210400000031107762155"Returning"
61821984"Graduation""Together"2664610"2014-02-10"2611420103522046000000311053836"Returning"
53241981"PhD""Married"5829310"2014-01-19"94173431184627155536500000031104221939"Returning"
………………………………………………………………………………………
108701967"Graduation""Married"6122301"2013-06-13"46709431824211824729345000000311013411853"Returning"
40011946"PhD""Together"6401421"2014-06-10"564060300087825700010031104442274"Returning"
72701981"Graduation""Divorced"5698100"2014-01-25"9190848217321224123136010000311012411939"Returning"
82351956"Master""Together"6924501"2014-01-24"84283021480306126510300000031108432364"Returning"
94051954"PhD""Married"5286911"2012-10-15"408436121213314700000031111721166"Returning"

初次客和回流客的比例。

In [132]:
alt.Chart(df).transform_joinaggregate(total='count(*)').transform_calculate(
    pct='1 / datum.total'
).mark_bar().encode(
    alt.X('sum(pct):Q', axis=alt.Axis(format='%'), title="佔整體比例"),
    alt.Y('CustomerType', title="顧客類型"),
).properties(
    title="初次客 / 回頭客比例"
).interactive()
Out[132]:

回頭客明顯多於初次客。

RFM 分群¶

In [133]:
# R: Recency, F: Frequency (購買次數), M: Monetary (花費金額)
rfm_df = df.select(Recency="Recency", Frequency="TotalPurchases", Monetary="TotalSales")
In [134]:
rfm_df
Out[134]:
shape: (2_216, 3)
RecencyFrequencyMonetary
i64i64i64
58251617
38627
2621776
26853
9419422
………
46181341
5622444
91191241
823843
4011172

將數值切成五等份,計算 RFM 分數,並為客戶打分。

In [135]:
rfm_df = rfm_df.with_columns(
    # Recency 分數(越小越好)
    RecencyScore=pl.col("Recency").qcut(5, labels=["5", "4", "3", "2", "1"]),
    # Frequency 分數(越大越好)
    FrequencyScore=pl.col("Frequency").qcut(5, labels=["1", "2", "3", "4", "5"]),
    # Monetary 分數(越大越好)
    MonetaryScore=pl.col("Monetary").qcut(5, labels=["1", "2", "3", "4", "5"]),
)

rfm_df = rfm_df.with_columns(
    CustomerScore=pl.concat_str(
        pl.col("RecencyScore"), pl.col("FrequencyScore"), pl.col("MonetaryScore")
))
In [136]:
rfm_df
Out[136]:
shape: (2_216, 7)
RecencyFrequencyMonetaryRecencyScoreFrequencyScoreMonetaryScoreCustomerScore
i64i64i64catcatcatstr
58251617"3""5""5""355"
38627"4""1""1""411"
2621776"4""4""4""444"
26853"4""2""1""421"
9419422"1""4""3""143"
…………………
46181341"3""4""5""345"
5622444"3""4""3""343"
91191241"1""4""5""145"
823843"5""5""4""554"
4011172"3""2""2""322"

篩選出最有價值的族群。

In [137]:
best_customers = rfm_df.filter(pl.col("CustomerScore") == "555")
In [138]:
best_customers
Out[138]:
shape: (22, 7)
RecencyFrequencyMonetaryRecencyScoreFrequencyScoreMonetaryScoreCustomerScore
i64i64i64catcatcatstr
2291693"5""5""5""555"
19321274"5""5""5""555"
14241581"5""5""5""555"
17291804"5""5""5""555"
1351485"5""5""5""555"
…………………
13281730"5""5""5""555"
6251376"5""5""5""555"
2232092"5""5""5""555"
6231518"5""5""5""555"
3231282"5""5""5""555"

最有價值群的 RFM 平均數。

In [139]:
best_customers.select(
    pl.col("Recency").mean(), pl.col("Frequency").mean(), pl.col("Monetary").mean()
)
Out[139]:
shape: (1, 3)
RecencyFrequencyMonetary
f64f64f64
9.40909126.7727271545.772727

此群佔總人數多少百分比?

In [140]:
print("Best customers: {:.3%}".format(len(best_customers) / len(rfm_df)))
Best customers: 0.993%

分出 常貴客、一次客、新顧客、流失客。

In [141]:
rfm_df = rfm_df.with_columns(
    RecencyLevel=pl.col("Recency").qcut(2, labels=["Low", "High"]),
    FrequencyLevel=pl.col("Frequency").cut([1], labels=["Low", "High"]),
)
In [142]:
rfm_df = rfm_df.with_columns(
    pl.when((pl.col("RecencyLevel") == "Low") & (pl.col("FrequencyLevel") == "Low"))
    .then(pl.lit("流失客"))
    .when((pl.col("RecencyLevel") == "Low") & (pl.col("FrequencyLevel") == "High"))
    .then(pl.lit("新顧客"))
    .when((pl.col("RecencyLevel") == "High") & (pl.col("FrequencyLevel") == "Low"))
    .then(pl.lit("先前客"))
    .when((pl.col("RecencyLevel") == "High") & (pl.col("FrequencyLevel") == "High"))
    .then(pl.lit("常貴客"))
    .alias("CustomerLevel")
)
In [143]:
rfm_df
Out[143]:
shape: (2_216, 10)
RecencyFrequencyMonetaryRecencyScoreFrequencyScoreMonetaryScoreCustomerScoreRecencyLevelFrequencyLevelCustomerLevel
i64i64i64catcatcatstrcatcatstr
58251617"3""5""5""355""High""High""常貴客"
38627"4""1""1""411""Low""High""新顧客"
2621776"4""4""4""444""Low""High""新顧客"
26853"4""2""1""421""Low""High""新顧客"
9419422"1""4""3""143""High""High""常貴客"
…………………………
46181341"3""4""5""345""Low""High""新顧客"
5622444"3""4""3""343""High""High""常貴客"
91191241"1""4""5""145""High""High""常貴客"
823843"5""5""4""554""Low""High""新顧客"
4011172"3""2""2""322""Low""High""新顧客"
In [144]:
rfm_df.plot.bar(alt.X("count()", title="人數"), alt.Y("CustomerLevel", title="顧客等級")).properties(title="顧客等級分佈")
Out[144]:

K-means¶

K-means 自訂組數分群:將數值標準化。K-means 前最好將特徵值先標準化,免得各值差異大,分群會被值大的特徵值拉走。

In [145]:
from sklearn.preprocessing import StandardScaler
In [146]:
scaler = StandardScaler()

rfm_df_numeric = rfm_df.select("Recency", "Frequency", "Monetary")

normalized_df = pl.DataFrame(
    scaler.fit_transform(rfm_df_numeric.to_numpy()), schema=rfm_df_numeric.columns
)
In [147]:
normalized_df
Out[147]:
shape: (2_216, 3)
RecencyFrequencyMonetary
f64f64f64
0.3105321.3194461.675488
-0.380509-1.157987-0.962358
-0.7951340.7978810.28025
-0.795134-0.897205-0.919224
1.5544070.537099-0.307044
………
-0.1040930.4067081.217598
0.2414280.928273-0.270546
1.4507510.5370991.051696
-1.4170721.0586640.391404
-0.311405-0.506031-0.7218

執行 K-means。使用 Silhouette 值決定最好的分群。

早期的版本是使用肘部圖 (Elbow Method),但效果不佳,故只畫圖而不用肘部圖判斷,改用 silhouette_score。^1

In [148]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


def find_best_kmeans(normalized_df: pl.DataFrame) -> tuple[KMeans, list[float]]:
    best_kmeans: KMeans | None = None
    silhouette_avg_max = 0
    inertias = list[float]()

    for n in range(3, 12):
        kmeans = KMeans(n_clusters=n, random_state=6).fit(normalized_df)
        silhouette_avg = silhouette_score(normalized_df.to_numpy(), kmeans.labels_, random_state=6)

        inertias.append(kmeans.inertia_)

        picked = silhouette_avg > silhouette_avg_max

        if picked:
            best_kmeans = kmeans
            silhouette_avg_max = silhouette_avg

        print(f"n_clusters: {n}, silhouette_score: {silhouette_avg}, picked: {picked}")

    assert best_kmeans is not None
    return best_kmeans, inertias

kmeans, interias = find_best_kmeans(normalized_df)
n_clusters: 3, silhouette_score: 0.382027566450604, picked: True
n_clusters: 4, silhouette_score: 0.3772275072151424, picked: False
n_clusters: 5, silhouette_score: 0.36749151515577905, picked: False
n_clusters: 6, silhouette_score: 0.32933115669120455, picked: False
n_clusters: 7, silhouette_score: 0.3316218205019696, picked: False
n_clusters: 8, silhouette_score: 0.3413972324456658, picked: False
n_clusters: 9, silhouette_score: 0.3356077105119019, picked: False
n_clusters: 10, silhouette_score: 0.336612138016919, picked: False
n_clusters: 11, silhouette_score: 0.33650869759565205, picked: False
In [149]:
alt.Chart(pl.DataFrame({"n_clusters": list(range(3, 12)), "inertia": interias})).mark_line(
    point=True
).encode(alt.X("n_clusters", title="n_clusters"), alt.Y("inertia", title="inertia")).properties(
    title="Elbow Method for Optimal K"
)
Out[149]:

對各群上標籤。

In [150]:
rfm_df = rfm_df.with_columns(Cluster=pl.Series(kmeans.labels_))
In [151]:
rfm_df
Out[151]:
shape: (2_216, 11)
RecencyFrequencyMonetaryRecencyScoreFrequencyScoreMonetaryScoreCustomerScoreRecencyLevelFrequencyLevelCustomerLevelCluster
i64i64i64catcatcatstrcatcatstri32
58251617"3""5""5""355""High""High""常貴客"2
38627"4""1""1""411""Low""High""新顧客"1
2621776"4""4""4""444""Low""High""新顧客"0
26853"4""2""1""421""Low""High""新顧客"1
9419422"1""4""3""143""High""High""常貴客"2
……………………………
46181341"3""4""5""345""Low""High""新顧客"0
5622444"3""4""3""343""High""High""常貴客"2
91191241"1""4""5""145""High""High""常貴客"2
823843"5""5""4""554""Low""High""新顧客"0
4011172"3""2""2""322""Low""High""新顧客"1

找到各群中心點並繪製成散佈圖。

In [152]:
kmeans_center = pl.DataFrame(
    scaler.inverse_transform(kmeans.cluster_centers_), schema=normalized_df.schema
).with_row_index()
In [153]:
kmeans_center
Out[153]:
shape: (3, 4)
indexRecencyFrequencyMonetary
u32f64f64f64
021.9507321.2864961047.507299
149.1992558.147114109.060521
273.64141421.1464651101.20202
In [154]:
kmeans_center.plot.scatter(x="Frequency", y="Monetary", size="Recency", color="index:N").properties(
    title="Frequency vs Monetary vs Recency"
)
Out[154]:

第 2 群相對是最有價值的。列出第 2 群的 RFM 平均數。

In [155]:
best_customers = rfm_df.filter(pl.col("Cluster") == 2)

best_customers.select("Recency", "Frequency", "Monetary").mean()
Out[155]:
shape: (1, 3)
RecencyFrequencyMonetary
f64f64f64
73.55536921.1728191101.409396

查看這一群的 RFM 標籤分佈。

In [156]:
best_customers.plot.bar(alt.X("count()", title="人數"), alt.Y("CustomerLevel", title="顧客等級")).properties(title="顧客等級分佈")
Out[156]:

各群特徵描述¶

看看每群人數。

In [157]:
rfm_df.plot.arc(theta=alt.Theta("count()", title="人數"), color=alt.Color("Cluster:N", title="分群")).properties(title="分群分佈")
Out[157]:

看看每群購買東西的量、收入以及平均上次造訪天數。

In [158]:
columns_to_include = [
    'MntWines',
    'MntFruits',
    'MntMeatProducts',
    'MntFishProducts',
    'MntSweetProducts',
    'MntGoldProds',
    'Income',
    'Recency',
    "Cluster",
]


df = df.with_columns(Cluster=rfm_df["Cluster"])


cluster_summary = df.select(columns_to_include).group_by("Cluster").mean().sort("Cluster")
In [159]:
cluster_summary
Out[159]:
shape: (3, 9)
ClusterMntWinesMntFruitsMntMeatProductsMntFishProductsMntSweetProductsMntGoldProdsIncomeRecency
i32f64f64f64f64f64f64f64f64
0529.06593446.641026295.28937763.49816844.81868167.77106266794.15384621.855311
147.6387345.73184425.0856618.5297955.61080116.46368736460.98230949.199255
2563.84060444.937919305.18959766.39932949.32718171.71476567367.79530273.555369

看看每一群購買的東西類型。

In [160]:
product_sales = (
    df.group_by("Cluster")
    .agg(
        Wines=pl.sum("MntWines"),
        Fruits=pl.sum("MntFruits"),
        MeatProducts=pl.sum("MntMeatProducts"),
        FishProducts=pl.sum("MntFishProducts"),
        SweetProducts=pl.sum("MntSweetProducts"),
        GoldProds=pl.sum("MntGoldProds"),
    )
    .sort("Cluster")
    .drop("Cluster")
    .transpose(
        include_header=True,
        header_name="Product",
        column_names=["Cluster 0", "Cluster 1", "Cluster 2"],
))
In [161]:
(
    product_sales.plot.bar(alt.X("Cluster 0"), alt.Y("Product"))
    & product_sales.plot.bar(alt.X("Cluster 1"), alt.Y("Product"))
    & product_sales.plot.bar(alt.X("Cluster 2"), alt.Y("Product"))
)
Out[161]:

每一群參與活動的比例。

In [162]:
product_sales = (
    df.group_by("Cluster")
    .agg(
        pl.sum("AcceptedCmp1"),
        pl.sum("AcceptedCmp2"),
        pl.sum("AcceptedCmp3"),
        pl.sum("AcceptedCmp4"),
        pl.sum("AcceptedCmp5"),
    )
    .sort("Cluster")
    .drop("Cluster")
    .transpose(
        include_header=True,
        header_name="Product",
        column_names=["Cluster 0", "Cluster 1", "Cluster 2"],
))
In [163]:
(
    product_sales.plot.bar(alt.X("Cluster 0"), alt.Y("Product"))
    & product_sales.plot.bar(alt.X("Cluster 1"), alt.Y("Product"))
    & product_sales.plot.bar(alt.X("Cluster 2"), alt.Y("Product"))
)
Out[163]:

每一群的孩子數量。

In [164]:
df.group_by("Cluster").agg(pl.sum("Kidhome"), pl.sum("Teenhome")).sort("Cluster").unpivot(
    index="Cluster"
).plot.bar("Cluster:N", "value", xOffset="variable", color="variable")
Out[164]:

每一群的收入分佈。

In [165]:
df.group_by("Cluster").agg(pl.sum("Income")).sort("Cluster").plot.scatter("Income", "Cluster:N")
Out[165]:

每一群的教育程度。

In [166]:
df.select("Cluster", "Education").plot.bar(
    alt.X("Cluster:N"),
    alt.Y("count()", title="人數"),
    "Education",
    xOffset="Education",
)
Out[166]:

分類與預測¶

預測每一群會參與接下來行銷活動的機率?¶

In [167]:
normalized_df = df.with_columns(Cluster=rfm_df["Cluster"])
In [168]:
normalized_df
Out[168]:
shape: (2_216, 34)
IDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWinesMntFruitsMntMeatProductsMntFishProductsMntSweetProductsMntGoldProdsNumDealsPurchasesNumWebPurchasesNumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponseTotalSalesTotalPurchasesageCustomerTypeCluster
i64i64strstri64i64i64stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64enumi32
55241957"Graduation""Single"5813800"2012-09-04"58635885461728888381047000000311116172563"Returning"2
21741954"Graduation""Single"4634411"2014-03-08"38111621621125000000311027666"Returning"1
41411965"Graduation""Together"7161300"2013-08-21"2642649127111214218210400000031107762155"Returning"0
61821984"Graduation""Together"2664610"2014-02-10"2611420103522046000000311053836"Returning"1
53241981"PhD""Married"5829310"2014-01-19"94173431184627155536500000031104221939"Returning"2
…………………………………………………………………………………………
108701967"Graduation""Married"6122301"2013-06-13"46709431824211824729345000000311013411853"Returning"0
40011946"PhD""Together"6401421"2014-06-10"564060300087825700010031104442274"Returning"2
72701981"Graduation""Divorced"5698100"2014-01-25"9190848217321224123136010000311012411939"Returning"2
82351956"Master""Together"6924501"2014-01-24"84283021480306126510300000031108432364"Returning"0
94051954"PhD""Married"5286911"2012-10-15"408436121213314700000031111721166"Returning"1

首先將 Education, Marital_Status 等離散資料轉換為 one-hot encoding。

In [169]:
one_hot_df = df.select(
    "Education",
    "Marital_Status",
    "Kidhome",
    "Teenhome",
).to_dummies()
In [170]:
one_hot_df
Out[170]:
shape: (2_216, 19)
Education_2n CycleEducation_BasicEducation_GraduationEducation_MasterEducation_PhDMarital_Status_AbsurdMarital_Status_AloneMarital_Status_DivorcedMarital_Status_MarriedMarital_Status_SingleMarital_Status_TogetherMarital_Status_WidowMarital_Status_YOLOKidhome_0Kidhome_1Kidhome_2Teenhome_0Teenhome_1Teenhome_2
u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8
0010000001000100100
0010000001000010010
0010000000100100100
0010000000100010100
0000100010000010100
…………………………………………………
0010000010000100010
0000100000100001010
0010000100000100100
0001000000100100010
0000100010000010010

數值資料則進行標準化。

In [171]:
scaler = StandardScaler()

fields = ["Income", "Recency", "TotalPurchases", "TotalSales", "age"]
continuous_df = df.select(fields)
normalized_continuous_df = pl.DataFrame(
    scaler.fit_transform(continuous_df.to_numpy()), schema=fields
)
In [172]:
normalized_continuous_df
Out[172]:
shape: (2_216, 5)
IncomeRecencyTotalPurchasesTotalSalesage
f64f64f64f64f64
0.2340630.3105321.3194461.6754880.986443
-0.234559-0.380509-1.157987-0.9623581.236801
0.769478-0.7951340.7978810.280250.318822
-1.017239-0.795134-0.897205-0.919224-1.266777
0.2402211.5544070.537099-0.307044-1.01642
……………
0.356642-0.1040930.4067081.2175980.151917
0.4675390.2414280.928273-0.2705461.904422
0.1880911.4507510.5370991.051696-1.01642
0.675388-1.4170721.0586640.3914041.069896
0.024705-0.311405-0.506031-0.72181.236801

合併這些標準化後的資料,加上我們用 K-means 分的群。

In [173]:
predictable_df = pl.concat([
    normalized_continuous_df,
    one_hot_df,
    df.select("Cluster", "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4"),
    ],
    how="horizontal",
)
In [174]:
predictable_df
Out[174]:
shape: (2_216, 29)
IncomeRecencyTotalPurchasesTotalSalesageEducation_2n CycleEducation_BasicEducation_GraduationEducation_MasterEducation_PhDMarital_Status_AbsurdMarital_Status_AloneMarital_Status_DivorcedMarital_Status_MarriedMarital_Status_SingleMarital_Status_TogetherMarital_Status_WidowMarital_Status_YOLOKidhome_0Kidhome_1Kidhome_2Teenhome_0Teenhome_1Teenhome_2ClusterAcceptedCmp1AcceptedCmp2AcceptedCmp3AcceptedCmp4
f64f64f64f64f64u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8u8i32i64i64i64i64
0.2340630.3105321.3194461.6754880.986443001000000100010010020000
-0.234559-0.380509-1.157987-0.9623581.236801001000000100001001010000
0.769478-0.7951340.7978810.280250.318822001000000010010010000000
-1.017239-0.795134-0.897205-0.919224-1.266777001000000010001010010000
0.2402211.5544070.537099-0.307044-1.01642000010001000001010020000
……………………………………………………………………………
0.356642-0.1040930.4067081.2175980.151917001000001000010001000000
0.4675390.2414280.928273-0.2705461.904422000010000010000101021000
0.1880911.4507510.5370991.051696-1.01642001000010000010010020001
0.675388-1.4170721.0586640.3914041.069896000100000010010001000000
0.024705-0.311405-0.506031-0.72181.236801000010001000001001010000